DCP-NAS: Discrepant Child-Parent Neural Architecture Search for 1-Bit CNNs
111
In the following, we prove that the Hessian matrix of the loss function is directly related
to the expectation of the covariance of the gradient. Taking the loss function as the negative
logarithm of the likelihood, let X be a set of input data from the network and p(X; ˆw, ˆα)
be the predicted distribution on X under the parameters of the network are ˆw and ˆα, i.e.,
output logits of the head layer.
By omitting ˆw for simplicity, Fisher’s information on the set of probability distributions
P = {pn(X; ˆα), n ∈N} can be described by a matrix whose value in the i-th row and the
j-th column.
Ii,j(ˆα) = EX [∂log pn(X; ˆα)
∂ˆαi
∂log pn(X; ˆα)
∂ˆαj
].
(4.36)
Recall that N denotes the number of classes described in Eq. 4.21. It is then trivial to prove
that the Fisher information of the probability distribution set P approaches a scaled version
of the Hessian of log-likelihood as
Ii,j(ˆα) = −EX [∂2 log pn(X; ˆα)
∂ˆαi∂ˆαj
].
(4.37)
Let Hi,j denote the second-order partial derivatives
∂2
∂ˆαi∂ˆαj . Note that the first derivative
of log-likelihood is
∂log pn(X; ˆα)
∂ˆαi
=
∂pn(X; ˆα)
pn(X; ˆα)∂ˆαi
,
(4.38)
The second derivative is
Hi,j log pn(X; ˆα) = Hi,jpn(X; ˆα)
pn(X; ˆα)
−
∂pn(X; ˆα)
pn(X; ˆα)∂ˆαi
∂pn(X; ˆα)
pn(X; ˆα)∂ˆαj
.
(4.39)
Considering that
EX (Hi,jpn(X; ˆα)
pn(X; ˆα)
) =
Hi,jpn(X; ˆα)
pn(X; ˆα)
pn(X; ˆα)dX
= Hi,j
pn(X; ˆα)dX = 0,
(4.40)
we take the expectation of the second derivative and then obtain the following.
EX (Hi,j log pn(X; ˆα)) = −EX { ∂pn(X; ˆα)
pn(X; ˆα)∂ˆαi
∂pn(X; ˆα)
pn(X; ˆα)∂ˆαj
}
= −EX {∂pn(X; ˆα)
∂ˆαi
∂pn(X; ˆα)
∂ˆαj
}.
(4.41)
Thus, an equivalent substitution for the Hessian matrix H ˜
fb(ˆα) in Eq. 4.32 is the product
of two first-order derivatives. This concludes the proof that we can use the covariance of
gradients to represent the Hessian matrix for efficient computation.
4.4.6
Decoupled Optimization for Training the DCP-NAS
In this section, we first describe the coupling relationship between the weights and the
architecture parameters in the DCP-NAS. Then we present the decoupled optimization
during backpropagation of the sampled supernet to fully and effectively optimize these two
coupling parameters.
Coupled
models
for
DCP-NAS
Combing
Eq.
4.27
and
Eq.
4.28,
we
first
show how parameters in DCP-NAS are formulated in a coupling relationship as